A Hybrid Approach for Entity Extraction in Code-Mixed Social Media Data
نویسندگان
چکیده
Entity extraction is one of the important tasks in various natural language processing (NLP) application areas. There has been a significant amount of works related to entity extraction, but mostly for a few languages (such as English, some European languages and few Asian languages) and doamins such as newswire. Nowadays social media have become a convenient and powerful way to express one’s opinion and sentiment. India is a diverse country with a lot of linguistic and cultural variations. Texts written in social media are informal in nature, and perople often use more than one script while writing. User generated content such as tweets, blogs and personal websites of people are written using Roman script or sometimes users may use both Roman as well as indigenous scripts. Entity extraction is, in general, a more challenging task for such an informal text, and mixing of codes further complicates the process. In this paper, we propose a hybrid approah for enity extraction from code mixed language pairs such as English-Hindi and EnglishTamil. We use a rich linguistic feature set to train Conditional Random Field (CRF) classifier. The output of classifier is post-processed with a carefully hand-crafted feature set. The proposed system achieve the F-scores of 62.17% and 44.12% for English-Hindi and English-Tamil language pairs, respectively. Our system attains the best F-score among all the systems submitted in Fire 2016 shared task for the English-Tamil language pairs. CCS Concepts •Computing methodologies→Natural Language Processing; •Information System → Information Extraction; •Algorithm → Conditional Random Field(CRF);
منابع مشابه
Named Entity Recognition for Code Mixing in Indian Languages using Hybrid Approach
Automating the process of Named Entity Recognition has received a lot of attention over past few years in Social Media Text. Named Entities are real world objects such as Person, Organization, Product, Location. Identifying these entities in social media text is an important challenging task due the informal nature of text present on social media. One such challenge that is faced in recognizing...
متن کاملConditional Random Fields for Code Mixed Entity Recognition
Entity Recognition is an essential part of Information Extraction, where explicitly available information and relations are extracted from the entities within the text. Plethora of information is available in social media in the form of text and due to its nature of free style representation, it introduces much complexity while mining information out of it. This complexity is enhanced more by r...
متن کاملA Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...
متن کاملSentiment Identification in Code-Mixed Social Media Text
Sentiment analysis is the Natural Language Processing (NLP) task dealing with the detection and classification of sentiments in texts. While some tasks deal with identifying presence of sentiment in text (Subjectivity analysis), other tasks aim at determining the polarity of the text categorizing them as positive, negative and neutral. Whenever there is presence of sentiment in text, it has a s...
متن کاملAMRITA_CEN@FIRE 2016: Code-Mix Entity Extraction for Hindi-English and Tamil-English Tweets
Social media text holds information regarding various important aspects. Extraction of such information serves as the basis for the most preliminary task in Natural Language Processing called Entity extraction. The work is submitted as a part of Shared task on Code Mix Entity Extraction for Indian Languages(CMEE-IL) at Forum for Information Retrieval Evaluation (FIRE) 2016. Three different meth...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016